Skip to content

Add a built-in RE2 implementation based on re2js#290

Open
jonbodner-buf wants to merge 18 commits intomainfrom
jonbodner/add_re2
Open

Add a built-in RE2 implementation based on re2js#290
jonbodner-buf wants to merge 18 commits intomainfrom
jonbodner/add_re2

Conversation

@jonbodner-buf
Copy link
Copy Markdown

The CEL spec says that its regular expressions meet the RE2 spec, but the existing ES implementation defaults to the regex implementation that's built into ES runtimes, which uses a backtracking implementation that has pathological cases susceptible to ReDOS attacks.

This PR adds a stripped-down version of RE2JS (https://github.com/le0pard/re2js) as a package, and integrates it into CEL as the default regex engine.

Copy link
Copy Markdown

@srikrsna srikrsna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments regarding the integration. I'll review the RE2 implementation in a second pass.

Comment thread packages/cel/src/checker.test.ts Outdated
Comment thread packages/cel/src/std/logic.ts Outdated
Comment thread packages/cel/src/std/logic.ts Outdated
/\\c[A-Z]/, // control character eg: /\cM\cJ/
/\\u[0-9a-fA-F]{4}/, // UTF-16 code-unit
/\\0(?!\d)/, // NUL
/\[\\b.*\]/, // Backspace eg: [\b]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this if it is re2?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment thread packages/cel/src/std/logic.ts Outdated
Comment on lines +60 to +66
// can probably delete this since the RE2 engine will already reject them, but keep for now
for (const invalidPattern of invalidPatterns) {
if (invalidPattern.test(pattern)) {
throw new Error(
`Error evaluating pattern ${pattern}, invalid RE2 syntax`,
);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should delete them

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment thread packages/cel/src/std/logic.ts Outdated
}
const re = new RegExp(pattern, flags);
return re.test(this);
const re: RE2JS = RE2JS.compile(pattern, flagVal);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that flags are part of the syntax in RE2? Can we add support for them in the library instead of trying to identify them here? Or am I missing something?

cel-go also passes them directly to regex engine without any preprocessing: https://github.com/google/cel-go/blob/646cdc1728643aec9499e3a00236ef1007a5d3fa/common/types/string.go#L156

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, not needed. Removing the code.

Comment thread packages/cel/package.json Outdated
"@bufbuild/re2": "0.4.0"
},
"devDependencies": {
"@unicode/unicode-16.0.0": "^1.6.16",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a peer dependency that is required?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, removed.

Comment thread packages/cel/package.json Outdated
"peggy": "^5.0.6",
"peggy-ts": "github:hudlow/peggy-ts#v0.0.9",
"expect-type": "^1.3.0"
"unicode-property-value-aliases": "^3.9.0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a peer dependency that is required?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, removed.

Comment thread packages/re2/package.json
Comment on lines +46 to +47
"@unicode/unicode-16.0.0": "^1.6.16",
"unicode-property-value-aliases": "^3.9.0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that these are added as dev dependencies in the cel package as well, will users end up needing them?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, they are used for tests and to build a unicode lookup table.

Comment thread package.json Outdated
Comment on lines +36 to +38
},
"dependencies": {
"@bufbuild/re2": "^0.1.0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's needed? The cel package now depends on the re2 package to run.

Copy link
Copy Markdown

@srikrsna srikrsna Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is the root workspace package.json not for the CEL package.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! yes, this should go.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// biome-ignore format: table
export default [
// !
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revert the formatting changes here. It was deliberately formatted like this for readability

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this wasn't intentional. I'll revert it.

Comment thread packages/cel/src/std/logic.ts Outdated
SemanticAdorner,
toDebugString,
} from "@bufbuild/cel-spec/testdata/to-debug-string.js";
} from "../../cel-spec/dist/cjs/testdata/to-debug-string.js";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} from "../../cel-spec/dist/cjs/testdata/to-debug-string.js";
} from "@bufbuild/cel-spec/testdata/to-debug-string.js";

Comment thread packages/re2/README.md

## Credits
This code is a fork of the [RE2JS](https://re2js.leopard.in.ua) project. It has been converted to TypeScript and has a feature set tailored for
CEL and Protovalidate-es. No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it does anything special for protovalidate, does it?

Suggested change
CEL and Protovalidate-es.
CEL.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, but when native rules are added to protovalidate-es, we will want to use the re2 engine for string pattern rules. Agree that we should remove that for now.

Comment thread packages/cel/src/std/logic.ts Outdated
pattern = pattern.substring(flagMatches[0].length);
}
const re = new RegExp(pattern, flags);
const re: RE2JS = RE2JS.compile(pattern);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconding Krishna's comment - return RE2JS.compile(pattern).test(this); is idiomatic style.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread packages/cel/src/std/logic.ts Outdated
celMethod(olc.ENDS_WITH, STRING, [STRING], BOOL, String.prototype.endsWith),
celMethod(olc.STARTS_WITH, STRING, [STRING], BOOL, String.prototype.startsWith),
celMethod(olc.MATCHES, STRING, [STRING], BOOL, matches),
// !
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous formatting looks more readable to me, and there's a formatter ignore comment on line 86 to keep it. Any reason to change the formatting? If yes, let's drop the formatter ignore.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format is back to the way it was (this was an unintentional change)

SemanticAdorner,
toDebugString,
} from "@bufbuild/cel-spec/testdata/to-debug-string.js";
} from "../../cel-spec/dist/cjs/testdata/to-debug-string.js";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can break in subtle and mysterious ways. Need to keep the previous import.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear. The current code is:

} from "../../cel-spec/dist/cjs/testdata/to-debug-string.js";

Do I change it to:

} from "@bufbuild/cel-spec/testdata/to-debug-string.js";

Or do I leave it as-is?

Comment thread packages/re2/tsconfig.json Outdated
Comment on lines +4 to +5
"include": ["src/**/*.test.ts"],
"exclude": ["./src/__tests__", "./src/__fixtures__", "./src/__utils__"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is broken - the exclude means we aren't catching any compiler errors in tests, fixtures, or utils.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the tests out of tests and the utils out of utils into src. The fixtures directory is now being scanned as well.

Comment thread packages/re2/.npmignore Outdated
Comment on lines +9 to +10
dist/*/testing.js
dist/*/testing.d.ts
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two lines aren't needed here. The picture will change with a fixed tsconfig.json that includes tests and fixtures.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


describe("bug-hunt verification", () => {
// Phase 1c: DFA.match ANCHOR_START with pos>0
test("executeEngine with ANCHOR_START and pos>0 finds substring match", () => {
Copy link
Copy Markdown

@le0pard le0pard May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test expects ANCHOR_START to mean "anchor the match to the current pos index (3)". However, in RE2 semantics, ANCHOR_START strictly means "the match must start at the absolute beginning of the input string (index 0)". So test is incorrect

P.S. update re2js vendoring, because latest will not pass here


// Phase 1b: equalsIgnoreCase EOF handling
test("equalsIgnoreCase(-1, X) returns true per current implementation", () => {
assert.strictEqual(equalsIgnoreCase(-1, 0x41), true);
Copy link
Copy Markdown

@le0pard le0pard May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 represents EOF. Case-insensitively comparing EOF to a valid character like A (0x41) should absolutely be false. So it is incorrect test case

P.S. update re2js vendoring, because latest will not pass here

Comment thread packages/re2/package.json
@@ -0,0 +1,49 @@
{
"name": "@bufbuild/re2",
"version": "0.4.0",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re2js already 2.3.1. Before v2 version it will be slow, because have only Pike VM (NFA), no DFA engines - https://github.com/le0pard/re2js#re2js-vs-re2-node-c-bindings

"// returns stride-1 ranges. Surrogates are included — String.fromCodePoint",
"// returns the lone surrogate char and platform regex matches \\p{Cs} on it.",
"const sweepPlatform = (pattern: string): Uint32Array => {",
' const re = new RegExp(pattern, "u")',
Copy link
Copy Markdown

@le0pard le0pard May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool idea about optimization, but just for info about problems to run on older js engines: the pattern passed into this function contains strings like \p{General_Category=L} or \p{Script=Latin}. Passing this to the native RegExp constructor alongside the "u" flag requires ES2018 support for Unicode property escapes. Older browsers, older versions of React Native, or legacy Node.js environments will immediately throw a SyntaxError here, crashing the script (which maybe not a problem, if not need support Node.js older than 10 version)

' const re = new RegExp(pattern, "u")',
" const ranges: number[] = []",
" let start = -1",
" for (let cp = 0; cp <= 0x10ffff; cp++) {",
Copy link
Copy Markdown

@le0pard le0pard May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop executes 1,114,112 times, allocating a string and executing a regex test on every single Unicode codepoint. Because JavaScript is single-threaded, this entirely blocks the main thread until the loop finishes. First-use CPU Spike, which blocking the main thread will happening here. That is why original re2js pays a penalty in bundle size by pre-compiling all arrays, but its compilation speed is consistently fast

"",
" const base = sweepPlatform(pattern)",
' const delta = kind === "category" ? _DELTA_CATEGORIES.get(name) : _DELTA_SCRIPTS.get(name)',
" const merged = delta ? mergeRanges(base, delta) : base",
Copy link
Copy Markdown

@le0pard le0pard May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for information: the delta mechanism assumes the host is running at least Unicode 15.0. If the host environment is running an ancient version of V8/Node with Unicode 12.0 or 13.0, applying the 15.0 -> 16.0 delta on top of the host's 12.0 base will result in an inaccurate and broken Unicode table (which maybe not a problem, if not need support Node.js older than 18 version)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants